For this analysis we will be using a dataset that contains 6497 with 11 variables on the chemical properties of the wine as well as a variable for the wine’s color (Red or White). It also includes an output variable called quality where 3 experts rate the wine on a scale from 0 to 10. There are 1599 reds and 4898 white wines in the data set.
The main question we are trying to answer with this analysis
## 'data.frame': 6497 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ color : chr "White" "White" "White" "White" ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 1.00 Min. : 6.0
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0
## Median :0.04700 Median : 29.00 Median :118.0
## Mean :0.05603 Mean : 30.53 Mean :115.7
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0
## Max. :0.61100 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50
## Median :0.9949 Median :3.210 Median :0.5100 Median :10.30
## Mean :0.9947 Mean :3.219 Mean :0.5313 Mean :10.49
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30
## Max. :1.0390 Max. :4.010 Max. :2.0000 Max. :14.90
## quality color
## Min. :3.000 Length:6497
## 1st Qu.:5.000 Class :character
## Median :6.000 Mode :character
## Mean :5.818
## 3rd Qu.:6.000
## Max. :9.000
We loaded tables above to examine some of the patterns in the data. Since quality is going to be our dependent variable for the analysis we decided to plot a simple histogram of the occurrences of quality.
We decided to do a side by side comparison of quality between the two colors of wine. It looks like Red Wines have mode of 5 vs the white wines mode of 6, let us try to scale the histograms by a percentage so we can better compare them.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The two histograms abaove are the same as the ones before but scaled to a percentage of total observations by color. It makes it easier to see the different since they are scaled. It is obvious now that Red Wine has more 5s than White wine. It appears about 80 percent of the scores for reds are 5 and 6 while white wine is more spread out and has more 7s.
We also took the means of the quality for each so we could see the differences. It appears each of the data sets have the same median but the mean is slightly higher for white wines.
Above is frequency plot of white wine quality and red wine quality. We have also scaled the values by percentage of the number of wines of that color in our data set. It is similar to the previous histograms but it allows us to put them on the same plot with the same scale. Looking at this, once again it seems like white whines are of slightly higher quality than reds, however the plots follow eachother fairly closely. Lets see if we can prove that on average white wines are of better quality than reds.
##
## Two Sample t-test
##
## data: subset(wine_df, color == "White")$quality and subset(wine_df, color == "Red")$quality
## t = 9.6856, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.2008028 Inf
## sample estimates:
## mean of x mean of y
## 5.877909 5.636023
We did a simple Welch two sample t test. With this test we make the following assumptions,
The p-value for this test is very low (close to 0)
This means that we can reject the null hypothesis that the red and white wines are rated the same quality accept that on average white wines are rated higher than reds.
Note this only applies to this dataset. Perhaps there are some chemical properties that white wines have that are different than red wines that are causing this small difference.
The plots above are histograms for all chemical properties of the wines. It appears that volatile.acidity, residual sugar and alcohol are all left skewed. Citric acid, pH and density look more normally distributed.
Above are attempted transformations on the total.sulfur.dioxide part of the histograms. We attempted a log10 scale, a square root scale and a cube root scale. Each scale tested indicate a more left skewed distribution than previously seen.
Above are boxplots for all the chemical properties. These types of plots just give us a general idea of what the data for each metric looks like. It gives an idea of what to look at going forward. it looks like Alcohol, density total.sulfur.dioxide have data that is more concentrated around the median with fewer outliers. Fixed.acitity, sulphates, volatile.aciditity, and chlorides look have they have more outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 3.000 5.443 8.100 65.800
## Low Medium High
## 2120 2393 1984
## Low Medium High
## 2832 2813 852
## # A tibble: 3 x 2
## alcohol.level quality
## <fct> <dbl>
## 1 Low 5.46
## 2 Medium 5.96
## 3 High 6.57
## # A tibble: 3 x 2
## sugar.level quality
## <fct> <dbl>
## 1 Low 5.78
## 2 Medium 5.91
## 3 High 5.75
We created some categorical variables for sugar and alcohol levels by splitting them up into Low, Medium and High. The two plots above are stacked histograms for each of these variables.
You can see that the higher alcohol content wines are in general have higher quality values (you can see the blue on the histogram is conecntatrated at 6 and above), while the high residual sugar wines are more concentrated at 5 and 6 quality.
You can also see this in the summarized metric tables provided. Higher alcohol level correlates to higher quality wine while higher sugar levels are associated with Medium quality.
There are 6497 wines in this data set, originally there were two data sets with both red and white wine but they were combined and a variable was added to differentiate them.
There are 11 chemical properties of the wine each with a numeric value (see below.)
There is also an output variable (Quality) which is a score between 0 and 10 which is the median of at least 3 evaluations by wine experts.
There is also a variable that indicates the type of wine called color which can be Red or White.
Finally two categorical variables were added based on the alcohol and residual sugar variables alcohol.level and sugar.level. These are both factors with 3 levels (Low, Medium and High)
alcohol.level
sugar.level
The main features of interest in our dataset are residual sugar, alcohol and quality. We want to use the combination of the 2 variables to see if we can create a model that determines the quality of the wine.
Along with residual sugar and alcohol some of the other chemical properties could also be used for our analysis. Some ones that peak my interst are pH and total sulfur dioxide as I suspect those might also have an affect.
Yes we created alcohol.level and sugar.level based on the alcohol and residual.sugar variables.
We attempted some transformations on total.sulfure.dioxide to see if we could get a more normal distribution. We did log10, square root and cubed root. We ended up with a more left skewed plot.
##
## Pearson's product-moment correlation
##
## data: wine_df$residual.sugar and wine_df$alcohol
## t = -31.04, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3804069 -0.3380525
## sample estimates:
## cor
## -0.3594148
##
## Pearson's product-moment correlation
##
## data: wine_df$alcohol and wine_df$quality
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4245892 0.4636261
## sample estimates:
## cor
## 0.4443185
The first graph above is Residual sugar vs alcohol content. There are a lot of values concentrated at a low residual sugar level and as you go to a higher residual sugar the alcohol content gets lower indicating a negative correlation.
The second graph is alcohol vs quality. It looks like as alcohol conent rises the quality rises indicating a positive correlation between alcohol conent and quality
##
## Pearson's product-moment correlation
##
## data: wine_df$quality and wine_df$residual.sugar
## t = -2.9824, df = 6495, p-value = 0.002871
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06124221 -0.01267509
## sample estimates:
## cor
## -0.03698048
The first graph above is just residual sugar vs quality. It looks like there is not strong correlation either way for these to attributes.
The next two graphs we tried to graph pH with two acidity metrics, citric acid and fixed.acidity. It looks like there is a slightly negative correlation between these metrics. Which makes sense since lower pH levels are more acidic.
The first graph above shows sulphates vs quality. The second graph shows pH vs quality. There looks to be little correlation between them.
The third graph is 9 plots for each quality with pH vs sulphates. Looking at each of these plots it is hard to tell if either of these factor into quality.
The five plots above give us quality vs 5 different variables. The five variables are free sulfur dioxide, chlorides, total sulfur dioxide, density and volatile acidity. Looking at the graphs it is hard to tell if there are corelations. I am going to take a closer look at density and chlorides as those appear to have some correlation with higher quality.
In the plots above we are looking at bar graphs of the median chlorides and a similar scatter plot graph of density that we had previously. The density does appear to have a median that goes down as quality goes up. You can also tell from the chlorides bar graph that the chlorides also go down as the quality goes up.
##
## Pearson's product-moment correlation
##
## data: wine_df$quality and wine_df$chlorides
## t = -16.508, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2238898 -0.1772134
## sample estimates:
## cor
## -0.2006655
The last bivariate graph is is a facet wrap of alcohol vs chlorides of each quality. If you look closely you can see as the quality gets higher the graph is shifting toward the y axis (lower chorides) and is shifting higher up (higher alcohol content).
After reviewing alcohol conent and residual sugar I determined that there was a positive correlation between alcohol content and quality. I also discovered that the pH negatively correlated with the acidity fields. After reviewing the other features I did find that chlorides and density do have a small negative correlation with quality.
I observed that pH has a negative correlation with the acidity features. This makes sense because a lower pH indicates higher acidity.
The strongest relationships I found were alcohol content to quality, chlorides, and density to quality.
The first plot is a Chrlorides vs Alcohol content with the colors set to the Quality. You can see as the chlorides lower and alcohol content get higher quality increases.
The plot of above is a plot of average quality by alcohol content for red and white wines. You can see clearly that the trend for both white and red wines is that higher alcohol content is correlated with higher quality.
The above is a ggpairs of all the variables we haven’t done that much exploration on. We know we want to look at chlorides and alcohol. Based on this graph the largest correlation with quality is density. Baased on this we will explore density a bit more.
The above graph shows desity vs chlorides. It looks like from the plot that density and chlorides have a positive correlation (which makes sense since they both have a negative correlation to quality). There are 3 plots for each alcohol level. You can see as the alcohol level gets higher the plot shifts down and turns more green. This means that alcohol level is negatively correlated with density and chlorides. It also means that higher alcohol level correlate with higher quality.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine_df)
## m2: lm(formula = I(quality) ~ I(alcohol) + chlorides, data = wine_df)
## m3: lm(formula = I(quality) ~ I(alcohol) + chlorides + density, data = wine_df)
## m4: lm(formula = I(quality) ~ I(alcohol) + density, data = wine_df)
##
## ==========================================================================
## m1 m2 m3 m4
## --------------------------------------------------------------------------
## (Intercept) 2.405*** 2.717*** -7.179 2.810
## (0.086) (0.094) (4.644) (4.512)
## I(alcohol) 0.325*** 0.308*** 0.324*** 0.325***
## (0.008) (0.008) (0.011) (0.011)
## chlorides -2.309*** -2.476***
## (0.285) (0.296)
## density 9.793* -0.399
## (4.595) (4.454)
## --------------------------------------------------------------------------
## R-squared 0.197 0.205 0.206 0.197
## adj. R-squared 0.197 0.205 0.206 0.197
## sigma 0.782 0.779 0.778 0.782
## F 1597.641 839.499 561.486 798.702
## p 0.000 0.000 0.000 0.000
## Log-likelihood -7623.404 -7590.806 -7588.534 -7623.400
## Deviance 3975.734 3936.038 3933.286 3975.729
## AIC 15252.809 15189.613 15187.068 15254.801
## BIC 15273.146 15216.729 15220.964 15281.917
## N 6497 6497 6497 6497
## ==========================================================================
I created 4 models 1 with quality vs alcohol content, 1 with quality alcohol content and chlorides and 1 with quality, alcohol content, chlorides and density and 1 with quality, alcohol content and denisty. It looks like the best model combined all 3 variables as it had a higher r^2 value. The density does not add much in comparison to the chlorides.
After reviewing some plots around alcohol content, chlorides and quality it is clear that alcohol content has the strongest correlation to quality.
I noticed that density and chlorides were postively correlated.
The model I created only explaines 20.6% of the variance. When we added density it added only improved he R^2 value by .1 %. This makes some sense since it was positively correlated with chlorides.
This model does not explain much of the variance between wine quality and would not be a good model to predict quality.
This was a plot of red wine vs white wine frequency vs quality. I was able to scale the grapah by the aount of white and red wines and was able to show that white wines on average are rated higher than reds.
It is the basis for the statistical test on quality vs color I did univariate section.
The second plot is a plot of average quality vs chloride level. This plot confirmed the negative correlation between quality and chloride level and is the basis for why I added the chloride feature to my model in the multi-variate analysis section.
The last plot is where I bring all three variables I suspect have correlations with quality together and graph them in one image. You can clearly see that as alcohol level is increasing so does the quality. You can also see the negative correlation that chlorides and density has with quality ase the graph shifts down as you move accross each alcohol level.
Overall I was able to determine that on average white wines are rated higher quality than white wines. I was also able to determine that alcohol content correlates to higher quality and chlorides and density negatively correlates to quality.
I was able to use different types of plots to show this as well as create a linear model that helps determine quality based on these variables.
It was a struggle to find features that help determine quality. A lot of the variables seemed to have very little impact or correlations. I tried to transform some of the variables but they still did not seem to correlate very well.
I think if I was to do further analysis I would try to transform more of the variables with square roots or research other transformations I could do. It would also be more interesting if the quality variable was not a median but was mean or if it was from 0 to 100 instead of 0 to 10.